在这项工作中,我们探讨了随机梯度下降(SGD)训练的深神经网络的限制动态。如前所述,长时间的性能融合,网络继续通过参数空间通过一个异常扩散的过程,其中距离在具有非活动指数的梯度更新的数量中增加距离。我们揭示了优化的超公数,梯度噪声结构之间的复杂相互作用,以及在训练结束时解释这种异常扩散的Hessian矩阵。为了构建这种理解,我们首先为SGD推导出一个连续时间模型,具有有限的学习速率和批量尺寸,作为欠下的Langevin方程。我们在线性回归中研究了这个方程,我们可以为参数的相位空间动态和它们的瞬时速度来得出精确的分析表达式,从初始化到实用性。使用Fokker-Planck方程,我们表明驾驶这些动态的关键成分不是原始的训练损失,而是修改的损失的组合,其隐含地规则地规范速度和概率电流,这导致相位空间中的振荡。我们在ImageNet培训的Reset-18模型的动态中确定了这种理论的定性和定量预测。通过统计物理的镜头,我们揭示了SGD培训的深神经网络的异常限制动态的机制来源。
translated by 谷歌翻译
What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.
translated by 谷歌翻译
This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our algorithm predicts the relative rotation and translation between the two respective cameras. Despite recent progress in the field, current deep-based methods exhibit only limited generalization to scenes not seen in training. Our approach introduces a network architecture that extracts a grid of coarse features for each input image using the pre-trained LoFTR network. It subsequently relates corresponding features in the two images, and finally uses a convolutional network to recover the relative rotation and translation between the respective cameras. Our experiments indicate that the proposed architecture can generalize to novel scenes, obtaining higher accuracy than existing deep-learning-based methods in various settings and datasets, in particular with limited training data.
translated by 谷歌翻译
现代机器学习中的一个主要挑战是理论上了解过度参数化模型的概括属性。许多现有工具依赖于\ em统一的收敛\ em(UC),该属性在拥有时保证测试损失将接近培训损失,并在一类候选模型上均匀地进行。 Nagarajan和Kolter(2019)表明,在某些简单的线性和神经网络设置中,任何统一的融合绑定都将是空置的,这是如何在UC失败的设置中证明概括的问题。我们的主要贡献是在两个这样的环境中证明了新的概括界限,一种线性和一种非线性。我们研究了Nagarajan和Kolter的线性分类设置,以及通过非线性政权中的两层神经网络学到的二次地面真实函数。我们证明了一种新类型的边距结合,表明高于某个信号到噪声阈值,在这两种设置中,任何接近最大的最大分类器几乎都不会实现测试损失。我们的结果表明,接近最大利润很重要:虽然任何实现至少达到$(1 - \ epsilon)$的模型 - 最大额度的分数很好地概括了,但分类器可实现一半的最大值。 。我们还加强了Nagarajan和Kolter的UC不可能结果,证明了\ em单方面\ EM UC的边界和经典边界界限将在接近最大的最大量化分类器上失败。我们的分析提供了有关为什么记忆可以与概括共存的洞察力:我们表明,在发生概括但UC失败的这种挑战性方案中,近乎最大的最细边缘分类器同时包含一些可概括的组件和一些可记住数据的过度拟合组件。过度拟合组件的存在足以排除UC,但是近超级余量保证存在足够的可推广组件。
translated by 谷歌翻译
联邦平均(FedAVG),也称为本地SGD,是联邦学习中最受欢迎的算法之一(FL)。尽管其简单和普及,但到目前为止,FADVG的收敛速率尚未确定。即使在最简单的假设(凸,平滑,均匀和有界协方差)下,最着名的上限和下限也不匹配,目前尚不清楚现有分析是否捕获算法的容量。在这项工作中,我们首先通过为FedAVG提供与现有的上限相匹配的下限来解决这个问题,这表明现有的FADVG上限分析不可易于解决。另外,我们在异构环境中建立一个下限,几乎与现有的上限相匹配。虽然我们的下限显示了FEDAVG的局限性,但在额外的三阶平滑度下,我们证明了更乐观的最先进的收敛导致凸和非凸面设置。我们的分析源于我们呼叫迭代偏置的概念,这由SGD轨迹的期望从具有相同初始化的无噪声梯度下降轨迹的偏差来定义。我们在此数量上证明了新颖的尖锐边界,并直观地显示了如何从随机微分方程(SDE)的角度来分析该数量。
translated by 谷歌翻译